Automatic Dictionary Organization In NLP Systems For Oriental Languages

نویسندگان

  • V. Andrezen
  • L. Kogan
  • W. Kwitakowski
  • R. Minvaleev
  • R. Piotrowski
  • V. Shumovsky
  • E. Tioun
  • Yu. Tovmach
چکیده

This paper presents a description of automatic dictionaries (ADs) and dictionary entry (DE) schemes for NLP systems dealing with Oriental languages. The uniformity of the AD organization and of the DE pattern does not prevent the system from taking into account the structural differences of isolating (analytical), agglutinating and internal-flection languages. The "Speech Statistics" (SpSt) project team has been designing a linguistic automaton aimed at NL processing in a variety of forms. In addition to Germanic and Romance languages the system under development is to handle text processing of a number of Oriental languages. The strategy adopted by th~ SpSt ~roup is characterized by a lexicalized approach: the NLP algorit}nns for any language are entirely AD dependent, i.e., a large lexicon database has been provided, its entries being loaded with information including not only lexical, but also morphological, syntactic and semantic data. This information concentrated in dictionary entries (DEs) is essential for both source text analysis and target (Russian) text generation. The DE structure is largely determined by the typological features of the source language. The SpSt group has hitherto had to deal w.ith European languages and it was for these languages (inflective and inflective analytical) that the prototype entry schemes were elaborated and adopted. No doubt, the typological characteristics of Oriental languages required certain modifications to be made %o the basic scheme. Hence in the present paper each of the language types is given consideration. Agglutinating languages proved to be the most suitable to process according to the SpSt strategy. But an isolating language will be the first to be proposed for discussion. I. The AD organization for a~ isolating language: Chinese For the purposes of NLP it is plausible to assume written Chinese as exclusively isolating language where affixation is virtually non-existent. The few inflective word-forms are entered into the lexicon as unanalizable lexical items, whereas multiple grammar formants are treated as free structural elements. High degree of lexical ambiguity making syntactic disambiguation a must, and the fact that word boundaries are not explicitly AcrEs DE COLING-92, NANTES, 23-28 AO13T 1992 5 0 5 Paoc. OF COLING.92, NANTES, AUC. 23-28, 1992 marked in the text are well-known problems with Chinese text analysis. (Actually, in the MULTIS project elaborated by the SpSt group Chinese characters are transformed into 4-digit strings in conformity with Chinese Standard Telegraph Code). Thus grammatical and logico semantic relations in the text are expressed b y word order, structural words and semantic valencies. In addition to their role of the labels for syntactic units (predicate, direct and indirect objects, etc.) the structural words function as delimitators singling out word-forms and phrases. A separate sub-lexicon for structural words is accordingly provided within the whole lexicon database of Chinese as source language. The file of notional words comprises lexical items of various lengths ranging from one-character items to eight-character ones, no differentiation being made among one-stem words, composite words and phrases. A distinct version of the DE scheme is assigned to each of the two classes of lexical items: notional words (N/ W) and structural words (S/W). The D E scheme for N/W includes, along with syntactic and semantic, the following data: i) Part of ~pcech assignment; 2) Information on the lexical ambiguity. Thus, by way of example, the one-stem word sudan 'sultan' and composite beida 'Beijing University' are coded N00, where N denotes noun, while qianding 'to sign a treaty' is coded 0S0 where S denotes verb/noun lexicui ambiguity (to be eventually disambiguated by syntactic means). As to the DE schemes for S/Ws, each of these should include positional characteristics of the lexieal item and provide information on the way the given particle affects formation of the Russian equivalent. E.g., in the grammatical coding of the verbal aspect S/W le and nominal S/Ws de and ba the following points are marked: i) part of speech dependence; 2) position (proor post-position with respect to the N/W); 3) Russian matching;. 4) syntactic function. The information placed in a DE may be used in translating sentences as illustrated below: Sudan ba heyue qi-~sding The sultan the peace treaty signed In carrying out the lexico syntactical analysis of this sentence two word groups are delimitated : nominal group ba heyue and verbal group quanding le. In the ba-DE there are data to define ba as a S/W in preposition to a direct object which is equivalent to a Russian noun in the Accusative Case. In the le-OE there are data to define le as a verbal index in a post-position to a verbal predicate and indicating the completion of an action, equivalent to a Russian verb in the Past Tense, Perfective. (For the sake of simplicity the polyvalent and polysemantic nature of these particles is ignored in this example). 2. The AD organization for an agglutinating language: Turkish The agglutinative word-formation technique is characterized by ordered additiun of affixes to the stem to preOuce formant strings of various lengths. An outstanding feature of agglutinating languages is that these word-forms are not reproduced ready-made in speech but are constructed by the speaker actually 'ad hoc' according to definite rules. Each of the limited set of affixes imparts 'a semantic quant' or ACRES DE COLING-92 , NANTES, 23-28 Aou'r 1992 $ 0 6 PROC. OF COLING-92 , NANTES, AUG. 23-28, 1992 represents a grammatical category, i<.g. , see the following patterns where t , h e stem ' SU]. t a n ' and some of it.~ derivatives u, r e

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dictionary Organization in Linguistic Automaton for Oriental Languages

The central problem for natural language processing (NLP) systems dealing with non-Indo-European (“Oriental”) languages is how to develop automatic dictionaries (AD) and dictionary entry (DE) schemes. The point is that the need of Oriental language industrial NLP has been felt for some time. It has acquired additional urgency with the rapid growth of business contacts between Russia and the nat...

متن کامل

Use of NLP Tools in CALL System for Arabic

This article focuses on the development of Natural Language Processing (NLP) tools for Computer Assisted Language Learning (CALL). First, we have developed some NLP tools: a labelled dictionary of Arabic (as complete as possible), a generator for morphological derivatives, a Conjugator and a morphological analyzer for Arabic. Second, we used these tools to create a number of educational applica...

متن کامل

ITRI-00-15 Business Models for Dictionaries and NLP

NLP needs dictionaries, and dictionary-makers can useNLP tomake better dictionaries, so there is great potential for synergy between the two activities. To date, there has been only very limited collaboration. The two reasons for this are (a) dictionary publishers’ concerns regarding intellectual property, and (b) the different languages that lexicographers and NLP researchers speak. In this pa...

متن کامل

Named Entity Recognition and Classification in Kannada Language

Named Entity Recognition and classification (NERC) is an essential and challenging task in (NLP). Kannada is a highly inflectional and agglutinating language providing one of the richest and most challenging sets of linguistic and statistical features resulting in long and complex word forms, which is large in number. It is primarily a suffixing Language and inflected word starts with a root an...

متن کامل

Semantic Classiication for Practical Natural Language Processing

In the eld of natural language processing (NLP) there is now a consensus that all NLP systems that seek to represent and manipulate meanings of texts need an ontology, that is a taxonomic classiication of concepts in the world to be used as semantic primitives. In our continued eeorts to build a multilingual knowledge-based machine translation (KBMT) system using an interlingual meaning represe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992